feat(centroids): add return_area + SpatialData persist to get_centroids#1150
feat(centroids): add return_area + SpatialData persist to get_centroids#1150timtreis wants to merge 9 commits into
Conversation
13bde10 to
a784941
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1150 +/- ##
==========================================
+ Coverage 92.43% 92.50% +0.06%
==========================================
Files 51 51
Lines 7815 7912 +97
==========================================
+ Hits 7224 7319 +95
- Misses 591 593 +2
🚀 New features to boost your workflow:
|
Surface the per-instance area that the labels bincount already computes (and geometric area for shapes), and add a SpatialData dispatch that writes centroids/area into the element's annotating table, squidpy-style. - `return_area` on all element overloads: labels return the pixel/voxel count (free, already computed); shapes return `pi*r**2` for circles and `geometry.area` for polygons; points raise. Carried as a feature column on the returned Points element. - Harmonise the three element overload signatures (they previously disagreed on `return_background`). - New `get_centroids(sdata, element_name, ..., persist_as="adata")` dispatch: resolves the element's annotating table and writes `obsm["spatial"]` (x, y[, z]) and `obs["area"]` at the element's rows, in place. Raises (pointing to `persist_as="Points"`) when no table annotates the element. No standalone AnnData is ever created. - Element-level `persist_as="adata"` raises, directing to the SpatialData form (which can resolve the table). - Fast path: the table-writing path applies the coordinate transform to the tiny per-instance centroid array in-memory and short-circuits when it is an identity, avoiding the dask `transform()` round-trip. `coordinate_system=None` returns intrinsic coordinates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
…turn type - narrow transformation to BaseTransformation before to_affine_matrix - assert coordinate_system is not None after _validate_persist_args in the element overloads (None is rejected there for persist_as != 'adata') - cast element-dispatch get_centroids results to DaskDataFrame at the two callers (vectorize, datasets); only the SpatialData overload returns SpatialData
…trim docs Correctness: - _write_centroids_into_table: align instance ids via a single get_indexer; raise on a total miss (instance_key dtype != element index) instead of silently writing an all-NaN obsm['spatial']; partial misses (background, filtered instances) still NaN-fill as documented. - refuse to overwrite obsm['spatial'] when an existing array has a different width (would wipe coordinates of other regions sharing the table). - validate coordinate_system on the persist_as='adata' path too. Cleanup (no behavior change): - collapse the three near-identical element overloads into one stacked singledispatch registration delegating to _intrinsic_centroid_frame. - trim verbose docstrings/comments. Tests: instance_key dtype mismatch raises; dimension mismatch refuses to overwrite; invalid-element message updated.
…ds method - persist_as='adata' gains inplace: bool = True (mirrors sanitize_table): inplace=True mutates the resolved table and returns None; inplace=False copies only that AnnData, writes into the copy, returns it (sdata untouched). No whole-SpatialData copy. - add SpatialData.get_centroids(element_name, ...) method delegating to the module function (mirrors sdata.aggregate), so both sd.get_centroids(sdata, ...) and sdata.get_centroids(...) work. Calls the named _get_centroids_sdata overload directly so the full signature type-checks. - widen return type to DaskDataFrame | AnnData | None; update tests.
…edup scatter Correctness hardening (review round 2): - raise a clear error when the element index is non-unique (was a cryptic pandas InvalidIndexError from get_indexer on e.g. duplicate-index shapes). - obsm width-mismatch guard is now 1-D-safe (compare full .shape instead of indexing .shape[1], which raised IndexError on a 1-D obsm['spatial']). - reword the total-miss error (it's 'no shared instances', not only a dtype bug). Cleanup: - reuse the existing _affine_matrix_multiplication helper in _transform_centroid_coords (drops the inline matmul; gives the dead helper a caller). - collapse the duplicated coords/area NaN-fill into a local _scatter(). - _points_from_centroids: df.assign instead of copy-then-mutate. - name the squidpy storage keys (_SPATIAL_KEY/_AREA_KEY) instead of magic strings.
…test assertion - _resolve_annotating_table returns annotators[0] directly (get_element_annotators already yields set[str]; the str() wrap was noise). - extract _assert_obsm_matches_points test helper to remove two near-identical written-obsm vs element-Points comparison blocks.
582f163 to
460ec76
Compare
|
|
||
|
|
||
| def test_get_centroids_sdata_persist_into_table(full_sdata): | ||
| # `table` annotates `labels2d` (instance_id 0..99 == label values); background (0) -> NaN. |
There was a problem hiding this comment.
the comment is confusing, I think here with 0..99 we mean the iloc (the value of instance_id at index 0 is not 0). Better to change the comment.
|
|
||
|
|
||
| def test_get_centroids_sdata_persist_fastpath_matches_transform(full_sdata): | ||
| # the in-memory affine fast path (adata) must equal the dask transform() path (element Points). |
There was a problem hiding this comment.
I see what it means, but still an obscure comment. Also transform() is not a dask method.
| # squidpy-style storage keys for persist_as="adata". | ||
| _SPATIAL_KEY = "spatial" | ||
| _AREA_KEY = "area" |
There was a problem hiding this comment.
This should go in models.py inside TableModel since it's basically an addition to the file format.
| def _transform_centroid_coords( | ||
| xy: np.ndarray, axes: list[str], e: SpatialElement, coordinate_system: str | None | ||
| ) -> np.ndarray: | ||
| """Apply the element's affine to centroid coords in-memory; ``None``/``Identity`` pass through. | ||
|
|
||
| ``axes`` is the column order of ``xy`` (e.g. ``["x", "y"]``). |
There was a problem hiding this comment.
this already works for any axes name and order, but the function makes it looks like it's only working for x and y. Better renaming.
| matrix = t.to_affine_matrix(input_axes=tuple(axes), output_axes=tuple(axes)) | ||
| return _affine_matrix_multiplication(matrix, xy) |
There was a problem hiding this comment.
Probably the matrix multiplication is slower than just a transposition, but if this is a bottleneck we'll find out later during profiling. Let's keep as is.
| The SpatialElement. Only points, shapes (circles, polygons and multipolygons) and labels are supported. | ||
| The SpatialElement (points, shapes — circles, polygons and multipolygons — or labels), or a | ||
| :class:`~spatialdata.SpatialData` object. When a ``SpatialData`` is passed, the second | ||
| positional argument is the name of the element to measure (see the ``SpatialData`` overload). |
There was a problem hiding this comment.
Not a big fan of this (positional argument depending on the single dispatch argument), but it works.
| return_area | ||
| If True, also return the per-instance area: the pixel/voxel count for labels and the geometric | ||
| area for shapes (``pi * r**2`` for circles). Not supported for points (raises). With | ||
| ``persist_as="Points"`` the area is added as a feature column of the returned Points element. |
There was a problem hiding this comment.
There is a design problem: the area is always returned untransformed (=in the intrinsic coordinate system). I think we should be consistent: if we transform the centroids (and we do this) we should transform the area. Otherwise we should not transform the centroids (or we should not return the area unless we have an identity/intrinsic).
I like to return the area, so I would consider always requiring coordinate_system=None (or mapped to a system where there is an identity) when we also compute the area.
There was a problem hiding this comment.
This requires to make a decision regarding the design, I don't have a solution in mind.
| if return_area: | ||
| raise ValueError("`return_area` is not supported for points elements (points have no area).") | ||
| axes = get_axes_names(element) | ||
| assert axes in [("x", "y"), ("x", "y", "z")] |
There was a problem hiding this comment.
we should turn this assert into a if raises
| get_centroids(full_sdata, "labels2d", persist_as="adata") | ||
|
|
||
|
|
||
| def test_get_centroids_sdata_persist_refuses_dim_mismatch(full_sdata): |
There was a problem hiding this comment.
This edge case only occurs when a table is annotating multiple elements, and those elements have different axes (i.e. xyz vs xy). I would make this clear. Note that the table could still be annotating some elements having xy axes, and some elements having yx axes. I will make this comment also elsewhere. (edit: there is no xy <> yx ambiguity, the ordering is always x, y[, z]).
There was a problem hiding this comment.
Action here: add a comment in the test explaining when this occurs.
| return annotators[0] | ||
|
|
||
|
|
||
| def _write_centroids_into_table( |
There was a problem hiding this comment.
This function seems to do the job, but I think there is a big design problem. Nothing stops the user in having a table annotating multiple elmements and where:
is annotating some elements with xy axes and some elements with yx axes -> the information of axes is lost once the centroids are in obsm- the get_centroids function gets called on different coordinate systems for different elements -> the information of which coordinate systems the centroids belong to is lost once centroids are in obsm.
I have some ideas but not a clear solution in mind. This requires some work into definiting how the Spatial AnnData specification works.
There was a problem hiding this comment.
Actually the first is not a problem since the returned centroids are always x, y[, z] (due to the assertion above)
axes = get_axes_names(element)
assert axes in [("x", "y"), ("x", "y", "z")]
There was a problem hiding this comment.
The second point requires a design decision.
| _validate_coordinate_system(element, coordinate_system) | ||
| table_name = _resolve_annotating_table(e, element_name, table_name) | ||
| df, area, raster = _intrinsic_centroid_frame(element, return_background, return_area) | ||
| coord_cols = sorted(df.columns) # canonical x, y[, z] (squidpy obsm["spatial"] order) |
There was a problem hiding this comment.
Important point to put in the docstring (and later in the specs).
There was a problem hiding this comment.
(that the order is always x, y[, z])
| return_background: bool = False, | ||
| return_area: bool = False, | ||
| persist_as: Literal["Points", "adata"] = "Points", | ||
| table_name: str | None = None, |
There was a problem hiding this comment.
if we pass a table name for a table that does not exist, we could createa a table annotating the element. WDYT?
for more information, see https://pre-commit.ci
|
Thanks @timtreis, the code looks good! Most of my comments are minor and they can be easily addressed (probably one-shot with an agent). But there are 2 discussion items pointing to a design problem that we need to resolve. Happy to hear your thoughts on them! |
Motivation
get_centroidsalready computes per-label pixel counts (np.bincount) when finding label centroids, then discards them, we could store asarea. This PR surfaces that area into an element's annotating table, squidpy-style (obsm["spatial"]/obs["area"]), so downstream tools (e.g. squidpy) and repeated renders can reuse them instead of recomputing.stacked on #1151